Red Wine Quality by Yasmin Aljedawi

This report explores a dataset describing the quality of 1,599 red wines based on their chemical properties.

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The red wine dataset contains 1,599 observations with 13 variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

It seems that overall the wine dataset is normally distributed with an average of approximately 6, this is an indication that it’s a collection of fairly good-quality wines, where 0 (very bad) and 10 (very excellent).

I plotted all chemical variables that might potentially have an impact on wine quality, I also wonder if they impact one another. At a glance, we can tell that data transformation can be applied on several variables as their histograms are positively skewed (residual sugar, chlorides, free and total sulfur dioxide, and sulphates)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most red wines are of a fixed acidity (tartaric acid) between [6 - 10] g/dm^3: mean 8.32 g/dm^3 and median 7.90 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most red wines are of a volatile acidity (acetic acid) between [0.3 - 0.7] g/dm^3: both mean and median are about 0.5 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Most red wines are of a citric acid, which adds ‘freshness’ and flavor to wines, between [0.1 - 0.5] g/dm^3: mean is about 0.27 g/dm^3 and median is about 0.26 g/dm^3, which is reasonable as citric acid is usually found in small quantities.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The histogram of residual sugar is positively skewed, so I applied log and sqrt transformation on the data to better understand the distribution of residual sugar. The log transformed residual sugar distribution appears normal with a peak at 2. Last histogram represents sqrt transformation of the data, which is still long tailed. Most red wines are of a residual sugar between [1 - 3] g/dm^3, mean is about 2.5 g/dm^3 and median is about 2.2 g/dm^3. It’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The histogram of chloride is positively skewed, so the log transformation of the data would help us to better understand its distribution, which appears to be normal with a peak at 0.08.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The histograms of free and total sulfur dioxide are both long tailed. With the log transformation of the data, we can tell that both variables have normal distributions. Free sulfur dioxide peaks at 9, 11 and 14, where total sulfur dioxide peaks at 40 mg / dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Most red wines are of density between [0.995 - 1.0]: mean and median are about 0.996 g / cm^3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Most red wines are of pH between [3.1 - 3.5]: mean and median are about 3.3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Most red wines are of sulphates between [0.5 - 0.9]: mean and median are about 0.6 g/dm3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Most red wines are of an alcohol percentage between [9 - 11.5]: mean and median are about 10%.

Univariate Analysis

What is the structure of your dataset?

A data frame containing 1,599 red wines with 11 attributes (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, pH, sulphates, alcohol) + output (quality rating) between 0 = very bad and 10 = very excellent where at least 3 wine experts rated the quality.

What is/are the main feature(s) of interest in your dataset?

The main features of my dataset are quality and alcohol. I am interested in deciding what features can be used to predict wine quality. I noticed a possible correlation between wine quality and alcohol, which can be used as a predictive model for wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Citric and volatile acidity, sulphates, and maybe density.

Did you create any new variables from existing variables in the dataset?

Yes, I noticed that the wine dataset did not have any factors. I created 3 new variables, the first is quality.factor which is a numeric categorical factor based on the wine quality number. The second is grade, I noticed that wine quality numbers can be lumped into three groups, for simplicity and less-busy plots, representing the grade of the wine. Grade can be either OK, Good or Very Good. The third variable is alcohol.intensity, which is a string description representing the intensity of the alcohol (Low, Medium, or High). However, it is important to note here that the alcohol intensity for my current sample only vary from 8.4% up to 14.9%, which is not a wide-enough range to label the upper limit as ‘High’ as some wine rating sites consider only wine with alcohol percentage above 15% as ‘High’.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, I applied both log and sqrt transformation on a subset of the data where their graphs were positively skewed in order to understand their distribution better.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Quality correlates highly with alcohol and volatile acidity (correlation coefficient > 0.3), but also there seems to be interesting correlations with some of the supporting variables. Free sulfur dioxide correlates highly with total sulfur dixoide, fixed acidity with both pH and density, density with both alcohol and residual sugar, sulphates and chlorides. Let me generate a correlation matrix to have a better insight.

I chose to show mainly the chemical features that perhaps has a meaningful correlation with wine quality. from the above correlation matrix, quality correlates positivly with alcohol, with a correlation coefficient of about 0.48. On the other hand, it correlates negatively with volatile acid, with a -0.39 coefficient. Citric and volatile acids tend to correlate negatively.

The original plot of wine quality and alcohol looks overplotted, so I decided to plot another one with transparency and jitter effects. I also graphed the average, 1st, 2nd, and 3rd quantile. As alcohol increases, wine quality increases too. Most of good-quality (rating of 5.5 and higher) wine is of alcohol concentration between 9.5 - 13%.

By plotting the relationship between quality and volatile acidity, it is also clear from the first plot that volatile acidity is a set of integers just like alcohol percentages. The plot suffered from overplotting, so I modified it using transperancy and jitter, where it is shown that there’s a negative correlation between quality and volatile acidity.

As citric acid level increases, sulphates level tend to increase as well.

There’s an interesting negative correlation between citric and volatile acid that can be clearly shown using geom_smooth function.

As density increases, residual sugar amount increases as well. Geom_smooth helped in showing the positive correlation.

Best quality wines have the highest sulphates levels. It is also interesting to see lots of sulphates outliers specifically with wine quality number 5. I wonder what the cause of this.

Wines with highest quality have the lowest median volatile acidity. Which is as I expected since it was shown from the correlation matrix that quality correlates negatively with volatile acidity.

Here, I plot one of the created factors representing alcohol intensity against density. High alcohol intensity wines have the lowest meadian density. I wonder if that also dictates that high quality wines have lower density since I noticed from above plots that alcohol correlates highly with wine quality. After plotting, yes it does.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I found a positive correlation btween wine quality and alcohol. A negative correlation between quality and volatile acidity.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, I found an interesting relationship between some supporting variables, volatile acidity and citric acid (negative), sulphates and citric acid (positive). Free sulfur dioxide correlates highly with total sulfur dixoide, fixed acidity with both pH and density, density with both alcohol and residual sugar, and sulphates correlates with chlorides.

What was the strongest relationship you found?

The quality of the wine is positivley and highly correlated with alcohol. Moreover, alcohol correlates very highly with the pH levels of the wine. On the other hand, the citric acid levels of the wine correlates highly and negatively with volatile acidity levels which in return correlates with wine quality as well.

Multivariate Plots Section

Here I’m exploring the correlation between wine grade and alcohol combined against a 3rd chemical feature. Since the rating.number factor still gives a busy plot, I grouped two quality numbers together where Q(3,4) is considered OK, Q(5,6) = Good, and Q(7,8) is Very Good. It looks to me that most of the high quality wines have medium alcohol intensity and higher citirc acid, but lower volatile acidity, and lower density.

I created those plots to explore how alcohol interacts with other chemical variables. High intensity alcohol have low chlorides but increasing free sulfur dioxide. Most of high alchohol intensity have both low density and volatile acidity. As for sulphates and residual sugar against alcohol intensity, the plot was very busy, so I changed the x and y limits and added the mean to the graph. I am not very sure what can I infer from this plot. I think I reached a deadend with it.

Wines with higher quality tend to have higher medians under low pH but high residual sugar levels, while the interesting catch is that lower quality wines tend to have higher medians as pH level increases. Another interesting patterns are found in the volatile acidity vs. sulphates plot. Wines of higher quality tend to have lower medians in sulphates levels. As for the last plot, it shows how wine quality correlates positively with citric acid but negatively with chlorides levels.

Wines of higher quality and alcohol have higher density.

Since I noticed from the correlation matrix that there’s a correlation between fixed acidity and both density and pH, I plotted the 3 variables. As fixed acidity increases, density increases as well, but pH decreases.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High quality wines have highest medians of alcohol but lower volatile acidity. Quality also seems to correlates positively with citric acid. Also from the line graphs, quality correlates positively with residual sugar but negativeily with pH.

Were there any interesting or surprising interactions between features?

The wine quality median pattern changes dramatically with the increase of pH levels. Higher quality wines have higher medians when pH levels are low. On the other hand, higher quality wines have lower medians when plotted against sulphates and volatile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.


Final Plots and Summary

Plot One

Description One

The distribution of wine quality appears to be normal with a median around 6.

Plot Two

Description Two

Wine quality correlates positively with alcohol concentration. 75% of my wine dataset has an alcohol concentration less than 13%.

Plot Three

Description Three

High quality wines appear to have higher alcohol percentage but lower volatile acidity. Higher citric acid and but lower chlorides


Reflection

The redwine dataset contains 1,599 observations with 13 variables. I started by exploring each variable individually by looking at the distribution of each. There are 12 chemical variables for each observation in the wine dataset which outputs a quality number. I converted the quality number into a categorical factor, and also created 2 other factors to represent alcohol intensity and a more simplified alcohol grade.

Through the exploratory data analysis, I managed to observe that wine quality was highly and positively correlated with alcohol content, which was surprising as I expected alcohol to come little bit after in terms of quality effect. Secondly, wine quality depends on volatile acidity but negatively, which is not surprising as too high of levels of volatile acidity can lead to an unpleasant, vinegar taste. Thirdly, unlike volatile acidity, wine quality correlates positively with citric acid, probably because citric acid can add ‘freshness’ and flavor to wines. Lastly, wine quality correlates negatively with chlorides which is not surprising as chlorides reflects the amount of salt in the wine.

Some limitation to the dataset is lack of fermentation information. After some research, fermentation in terms of the time and the process the wine took to be fermented can also affect wine quality. The predictive model for wine quality that to be developed would have more accuracy with a time-series data regarding each wine. Another limitation is the alcohol percentage. In my sample the alcohol percentage was not very high compared to the percentages wine rating websites abide by, where they consider high alcohol intensity to be of a concentration more than 15%.

References